This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Notes:

  1. If you are using this document in .Rmd format, you can press the “play” button on the upper right corner of the chunk of code (i.e., gray section) to run the code and see the results.

  2. If you are using this document in its .html format, you can copy-paste the code in an R script and see how it works.

1. Assigning and using variables

We can use the = sign to assign values to variables but the <- operator is conventionally preferred.

# "Initializing" a variable
# Variable <- Value

A <- 1
B <- 2
C <- 3

print(A)
## [1] 1
print(B)
## [1] 2
print(C)
## [1] 3

We can perform basic math operations by using the variables instead of the numbers assigned to them:

D <- A + B
D
## [1] 3

We can also overwrite variables:

# We can also overwrite variables

C <- 2 * A * B ^ B
C
## [1] 8

Note how the value of C changes from 3 to 8, subsequently changing the results of succeeding operations:

C + D
## [1] 11

Using variables:

2. Data types in R

Variables do not have to be numbers.

There are multiple data types in R and in other computer languages. The standard data types we will often encounter are:

a. Numeric

# Subcategories: integer & double

E <- 525600

b. Logic

# TRUE or FALSE

Outcome <- TRUE

c. Character

# Also known as string; written with quotation marks around them

BestProfession <- "Engineering"

We can print the variables to see the values assigned to them.

print(E)
## [1] 525600
print(Outcome)
## [1] TRUE
print(BestProfession)
## [1] "Engineering"

To know the type of the variable, we can use class()

class(E)
## [1] "numeric"
class(Outcome)
## [1] "logical"
class(BestProfession)
## [1] "character"

d. Factor

A factor in R is a data structure used to handle categorical variables. It is especially useful when the variable represents a fixed number of categories (example: male vs female; low-medium-high).

Factors are internally stores as integers with labels. They can also be “ordered” or “unordered”.

# Creating a factor
gender <- factor(c("Male", "Female", "Female", "Male", "Non-binary"))

# Check levels
levels(gender)
## [1] "Female"     "Male"       "Non-binary"

We can also specify the levels and orders of factors.

# Custom levels and order
level_ordered <- factor(c("Low", "High", "Medium"),
                        levels = c("Low", "Medium",
                                   "High"), 
                        ordered = TRUE)
level_ordered
## [1] Low    High   Medium
## Levels: Low < Medium < High
# Check if it's ordered
is.ordered(level_ordered)
## [1] TRUE

Converting between different types of variables

# From character to factor
x <- as.factor(c("A", "B", "A"))
x
## [1] A B A
## Levels: A B
class(x)
## [1] "factor"
# From factor to character
x <- as.character(x)
x
## [1] "A" "B" "A"
class(x)
## [1] "character"
# From character to numeric 
y <- c("1", "2", "3")
y <- as.numeric(y)
y
## [1] 1 2 3
class(y)
## [1] "numeric"

EXERCISE 1: Solving problems using variables in R.

Use variables in a script to solve for the number of liters of water needed annually by a town.

  • Each person uses on average 120 liters of water per day.
  • There are 10,000 residents in the town.
  • A golf course uses on average 1,400,000 liters of water per month.
  • Presume an average month is 30 days. There are three (3) golf courses in the town.

How much water does the town use per year?

# There are many ways to approach this problem. Here's one example:

# Step 1: Define the variables

Population <- 10000 # people in the town
Population_LPD <- 120 # water consumption per person (liters per day)

Golf_course_LPM <- 1400000 # water consumption of the golf course per month (liters)
Golf_course_no <- 3 # number of golf courses in the town

Days_year <- 365 # number of days in a year
Days_month <- 30 # number of days in a month

# Step 2: Do the computation

# Compute for the water consumption by:

# a. All people
People_use <- Population * Population_LPD * Days_year

# b. Golf course
Golf_use <- Golf_course_no * ((Golf_course_LPM/Days_month) * Days_year)

# c. Total use
Total_use <- People_use + Golf_use

print(Total_use)
## [1] 489100000

Can we add text (characters) to the printed output so that it provides more information?

Using the paste() function allows us to do that.

print(paste("The total water consumption in the town is", Total_use, "liters per year."))
## [1] "The total water consumption in the town is 489100000 liters per year."

3. Data structures in R

We were able to assign single values to variables. But what if we have a number of related values?

Crop1 <- "rice"
Crop2 <- "corn"
Crop3 <- "sugarcane"
Crop4 <- "cassava"

It can be a bit tedious to assign each value to a different variable. Instead, what we can do is to group them:

Crops <- c("rice", "corn", "sugarcane", "cassava")
Crops
## [1] "rice"      "corn"      "sugarcane" "cassava"
# We can also use the assigned variables to group them together:

Crops2 <- c(Crop1, Crop2, Crop3, Crop4)
Crops2
## [1] "rice"      "corn"      "sugarcane" "cassava"

a. Vectors

A vector is a data structure that holds elements of the same data type.

Note the syntax for a vector: c(Item1, Item2, …) c = concatenate = link things together in a chain or series

# Numeric vector
v1 <- c(1, 2, 3, 4, 5)

# Character vector
v2 <- c("apple", "banana", "grapes", "cherry", "strawberry")

# Logical vector
v3 <- c(TRUE, FALSE, TRUE, FALSE, FALSE)

Other ways to create vectors:

# Sequence of numbers
v_seq <- seq(0, 50, 2) # Sequence of numbers from 0 to 50, by 2s

# Repeating values
v_rep <- rep(5, times = 4)

We can check the type and length of vectors:

length(Crops)     # number of elements
## [1] 4
typeof(Crops)     # data type
## [1] "character"
is.vector(Crops)  # TRUE if it is a vector
## [1] TRUE

a.1. Accessing values in a vector

We can use VectorName[index#] to isolate the desired item. (“Indexing”)

# Using the Crops vector:

Crops[1]    # Gets the first element
## [1] "rice"
Crops[4]    # Gets the fourth element
## [1] "cassava"
# Accessing multiple values in a vector

Crops[1:2]
## [1] "rice" "corn"
# Overwriting an element in a vector using indexing

Crops[3] <- "dragonfruit"
Crops
## [1] "rice"        "corn"        "dragonfruit" "cassava"

a.2. Vector operations

R is vectorized: it is designed to perform operations on entire vectors of data at once instead of doing one element at a time.

x <- c(1, 2, 3)
y <- c(4, 5, 6)

x + y   # 5 7 9
## [1] 5 7 9
x * 2   # 2 4 6
## [1] 2 4 6
x > 2   # FALSE FALSE TRUE
## [1] FALSE FALSE  TRUE

Vectors contain groups of objects in one dimension (column or row).

Matrices contain groups of objects in two dimensions (a grid).

Arrays contain groups of objects in any number of dimensions (i.e., vectors and matrices are just specific types of an array).

b. Matrix

A matrix is a 2D structure where all elements must be of the same data type.

There are many ways to initialize a matrix.

# Creating a matrix (2D array)
# Option 1: using array() (since a matrix is an array)

m1 <- array(data = 1:10, dim = c(5, 2))
m1
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
# Option 2: using matrix()

m2 <- matrix(data = 1:10, nrow = 5, byrow = FALSE)
m2
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10

We can use class() to see the object’s class (its behavior or type as seen by users) and/or typeof() to see the internal storage type that R used for the object.

class(m1)
## [1] "matrix" "array"
typeof(m1)
## [1] "integer"
class(m2)
## [1] "matrix" "array"
typeof(m2)
## [1] "integer"

We can also create a matrix by binding vectors.

# Column-bind
m3 <- cbind(c(1,2), c(3,4))

# Row-bind
m4 <- rbind(c(1,2), c(3,4))

print(m3)
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
print(m4)
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
# Try to spot the difference between doing a cbind versus rbind:

b.1. Accessing elements in a matrix

# Create a sample matrix

m5 <- matrix(1:20, nrow = 4)
m5
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    5    9   13   17
## [2,]    2    6   10   14   18
## [3,]    3    7   11   15   19
## [4,]    4    8   12   16   20
# Access elements in the matrix

m5[1, 2]   # Row 1, Column 2
## [1] 5
m5[ , 2]   # Entire column 2
## [1] 5 6 7 8
m5[4, ]    # Entire row 4
## [1]  4  8 12 16 20

b.2. Matrix operations

# Create matrices

m6 <- matrix(1:4, nrow = 2)
m7 <- matrix(5:8, nrow = 2)

m6
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
m7
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8
m6 + m7   # Element-wise addition
##      [,1] [,2]
## [1,]    6   10
## [2,]    8   12
m6 * m7   # Element-wise multiplication
##      [,1] [,2]
## [1,]    5   21
## [2,]   12   32

b.3. Useful matrix functions

dim(m5)      # dimensions (number of rows, number of columns)
## [1] 4 5
nrow(m5)     # number of rows
## [1] 4
ncol(m5)     # number of columns
## [1] 5
rowSums(m5)  # sum of each row
## [1] 45 50 55 60
colSums(m5)  # sum of each column
## [1] 10 26 42 58 74
rowMeans(m5) # average of each row
## [1]  9 10 11 12
colMeans(m5) # average of each column
## [1]  2.5  6.5 10.5 14.5 18.5
t(m5)        # transpose
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12
## [4,]   13   14   15   16
## [5,]   17   18   19   20

c. Data Frame

Vectors and matrices require that their elements are of the same data type.

What if we want to combine different data types?

A data frame is a 2-dimensional table-like structure - Each column is a vector (of the same length) - Different columns can have different data types (numeric, character, factor, etc.),

It is the most commonly used structure for data sets in R (like Excel sheets).

df <- data.frame(
  crop = c("rice", "corn", "sugarcane", "dragonfruit", "cassava"),
  weight_kg = c(100, 250, 80, 550, 150),
  days_in_storage = c(10, 15, 8, 9, 5)
)

df
##          crop weight_kg days_in_storage
## 1        rice       100              10
## 2        corn       250              15
## 3   sugarcane        80               8
## 4 dragonfruit       550               9
## 5     cassava       150               5
# To view the df in a separate tab:
View(df)

c.1. Accessing data frame elements

df$crop     # use dollar sign then the name of the column
## [1] "rice"        "corn"        "sugarcane"   "dragonfruit" "cassava"
df[1, 2]            # by row and column index
## [1] 100
df[ , "weight_kg"]  # all rows of column weight_kg
## [1] 100 250  80 550 150
df[1, ]             # entire first row
##   crop weight_kg days_in_storage
## 1 rice       100              10
# Using subset()

subset(df, weight_kg > 100)
##          crop weight_kg days_in_storage
## 2        corn       250              15
## 4 dragonfruit       550               9
## 5     cassava       150               5

c.2. Inspecting a data frame

# dimensions (number of rows, number of columns)
dim(df)          
## [1] 5 3
# structure
str(df)   
## 'data.frame':    5 obs. of  3 variables:
##  $ crop           : chr  "rice" "corn" "sugarcane" "dragonfruit" ...
##  $ weight_kg      : num  100 250 80 550 150
##  $ days_in_storage: num  10 15 8 9 5
# summary statistics
summary(df)
##      crop             weight_kg   days_in_storage
##  Length:5           Min.   : 80   Min.   : 5.0   
##  Class :character   1st Qu.:100   1st Qu.: 8.0   
##  Mode  :character   Median :150   Median : 9.0   
##                     Mean   :226   Mean   : 9.4   
##                     3rd Qu.:250   3rd Qu.:10.0   
##                     Max.   :550   Max.   :15.0
# column names
names(df)
## [1] "crop"            "weight_kg"       "days_in_storage"

c.3. Modifying data frames

# Adding a new column
df$storage_room <- c(1, 1, 2, 3, 4)
df
##          crop weight_kg days_in_storage storage_room
## 1        rice       100              10            1
## 2        corn       250              15            1
## 3   sugarcane        80               8            2
## 4 dragonfruit       550               9            3
## 5     cassava       150               5            4
# Renaming a column
# a. Rename a single column by name
names(df)[names(df) == "weight_kg"] <- "weight_tons"
df
##          crop weight_tons days_in_storage storage_room
## 1        rice         100              10            1
## 2        corn         250              15            1
## 3   sugarcane          80               8            2
## 4 dragonfruit         550               9            3
## 5     cassava         150               5            4
# b. Rename by column position
names(df)[2] <- "weight_kg"
df
##          crop weight_kg days_in_storage storage_room
## 1        rice       100              10            1
## 2        corn       250              15            1
## 3   sugarcane        80               8            2
## 4 dragonfruit       550               9            3
## 5     cassava       150               5            4
# c. Rename multiple columns
names(df) <- c("crop_name", "weight", "number_of_days_stored", "room_no")
df
##     crop_name weight number_of_days_stored room_no
## 1        rice    100                    10       1
## 2        corn    250                    15       1
## 3   sugarcane     80                     8       2
## 4 dragonfruit    550                     9       3
## 5     cassava    150                     5       4

d. Lists

A list in R is a flexible data structure that can hold elements of different types and lengths, including:

  • vectors
  • matrices
  • data frames
  • and even other lists

They are building blocks of more complex R objects (like models).

# Creating a simple list
list1 <- list(1, "hello", TRUE, c(2, 3, 4))
list1
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "hello"
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] 2 3 4
# Creating a named list
list2 <- list(crop = "rice",
              weight_kg = c(100, 250, 80),
              status_in_storage = c(TRUE, FALSE, TRUE),
              room_no = "1a")
list2
## $crop
## [1] "rice"
## 
## $weight_kg
## [1] 100 250  80
## 
## $status_in_storage
## [1]  TRUE FALSE  TRUE
## 
## $room_no
## [1] "1a"

d.1. Accessing list elements

# 1. Using $ for named elements

list2$weight_kg
## [1] 100 250  80
# 2. Using double brackets [[]]

list2[[2]]                # gets the second element
## [1] 100 250  80
list2[["weight_kg"]]      # gets the "weight_kg" (which is also the second element)
## [1] 100 250  80

4. More about data frames in R

We will focus in working on data frames since it will be the usual data structure of most data sets that we will be using in our work.

R comes with several built-in data frames that are perfect for learning, testing, and practicing data analysis.

These are preloaded with base R or available in standard packages like datasets.

# How to see all built-in data sets in R
data()

# A separate tab showing all built-in data sets will come out. View all available data.
# Let us use ChickWeight = Weight versus age of chicks on different diets

# STEP 1: Load the data set

data("ChickWeight")  # Load the data set
# STEP 2: See the documentation of the data set

?ChickWeight

# A documentation will appear in the Help tab.
# STEP 3: Explore the data

head(ChickWeight)     # shows the first few rows
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1
tail(ChickWeight)    # shows the last few rows
##     weight Time Chick Diet
## 573    155   12    50    4
## 574    175   14    50    4
## 575    205   16    50    4
## 576    234   18    50    4
## 577    264   20    50    4
## 578    264   21    50    4
str(ChickWeight)    # shows the structure
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   578 obs. of  4 variables:
##  $ weight: num  42 51 59 64 76 93 106 125 149 171 ...
##  $ Time  : num  0 2 4 6 8 10 12 14 16 18 ...
##  $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
##  $ Diet  : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "formula")=Class 'formula'  language weight ~ Time | Chick
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "outer")=Class 'formula'  language ~Diet
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Time"
##   ..$ y: chr "Body weight"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(days)"
##   ..$ y: chr "(gm)"
summary(ChickWeight)   # shows the summary statistics
##      weight           Time           Chick     Diet   
##  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
##  Median :103.0   Median :10.00   20     : 12   3:120  
##  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
##  Max.   :373.0   Max.   :21.00   19     : 12          
##                                  (Other):506

Note that summary() can give us the summary statistics of the data frame but we can also use separate functions to do this, if needed:

# mean
mean_weight <- mean(ChickWeight$weight)

# median
median_weight <- median(ChickWeight$weight)

# mode
mode_weight <- mode(ChickWeight$weight)

# standard deviation
sd_weight <- sd(ChickWeight$weight)

print(mean_weight)
## [1] 121.8183
print(median_weight)
## [1] 103
print(mode_weight)
## [1] "numeric"
print(sd_weight)
## [1] 71.07196
# We assign the values to variables so they are saved in our environment and we can use them later in other computations.

We can also used the base R plot() function to see the matrix of scatterplots, also called pairs plot.

# STEP 4: Explore the relationships between the numeric columns in the data frame

plot(ChickWeight)

This is a scatterplot matrix of all numeric columns in the data frame. Each cell shows a scatterplot between two numeric variables.

Note that the plot in row 1, column 2 is just a mirror of row 2, column 1 (i.e., axes flipped).

What to look for in each plot:

1. Linear patterns

2. Curved patterns

3. Clusters

4. Outliers

5. Importing data sets to R using base R functions

a. CSV files

Why is CSV file preferred over Excel?

  1. Comma-separated values (CSV) is just raw data in plain text
  2. Works across any programming language (R, Python, SQL, JavaScript, etc.)
  3. No hidden formatting, just data = reduces unexpected behavior when importing
  4. Not proprietary! We don’t need Excel or any licensed software to open or edit a CSV.

We mainly use the read.csv() to bring in .csv files into R.

The tricky part here is identifying the file path to the data.

# We can use getwd() to help identify our working directory, where the .csv file (preferably) should also be located.

getwd()
## [1] "/Users/amyeldalecero/R/TRAINING/Intro_to_R_training/01_Scripts/01_Rmds"
# Let us try importing the provided CSV file.
# Provide the file path to the data:

data <- read.csv("/Users/amyeldalecero/R/TRAINING/Intro_to_R_training/00_Data/data.csv", 
                     header = TRUE)

# Note that this code will not work for you because your file will have a different file path.

# You will have to revise the line of code above to make it (and the rest of the code below) to work.

b. TXT files

We can use read.table() to import text files into R.

c. Excel files

Base R cannot read Excel files directly. We will need external packages like readxl or openxlsx.

What we can do is to save the Excel sheet as a .csv – can be a problem when data is saved in multiple tabs!

Important reminders:

  1. Use absolute paths C:/Users/yourname/Documents/file.csv

  2. Or relative paths (relative to your working directory): data/file.csv

6. Exploring the imported data

Once we have imported the file, the first step is to always explore it.

# Explore 'data'

dim(data)
## [1] 11914    16
str(data)
## 'data.frame':    11914 obs. of  16 variables:
##  $ Make             : chr  "BMW" "BMW" "BMW" "BMW" ...
##  $ Model            : chr  "1 Series M" "1 Series" "1 Series" "1 Series" ...
##  $ Year             : int  2011 2011 2011 2011 2011 2012 2012 2012 2012 2013 ...
##  $ Engine.Fuel.Type : chr  "premium unleaded (required)" "premium unleaded (required)" "premium unleaded (required)" "premium unleaded (required)" ...
##  $ Engine.HP        : int  335 300 300 230 230 230 300 300 230 230 ...
##  $ Engine.Cylinders : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ Transmission.Type: chr  "MANUAL" "MANUAL" "MANUAL" "MANUAL" ...
##  $ Driven_Wheels    : chr  "rear wheel drive" "rear wheel drive" "rear wheel drive" "rear wheel drive" ...
##  $ Number.of.Doors  : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ Market.Category  : chr  "Factory Tuner,Luxury,High-Performance" "Luxury,Performance" "Luxury,High-Performance" "Luxury,Performance" ...
##  $ Vehicle.Size     : chr  "Compact" "Compact" "Compact" "Compact" ...
##  $ Vehicle.Style    : chr  "Coupe" "Convertible" "Coupe" "Coupe" ...
##  $ highway.MPG      : int  26 28 28 28 28 28 26 28 28 27 ...
##  $ city.mpg         : int  19 19 20 18 18 18 17 20 18 18 ...
##  $ Popularity       : int  3916 3916 3916 3916 3916 3916 3916 3916 3916 3916 ...
##  $ MSRP             : int  46135 40650 36350 29450 34500 31200 44100 39300 36900 37200 ...
head(data)
##   Make      Model Year            Engine.Fuel.Type Engine.HP Engine.Cylinders
## 1  BMW 1 Series M 2011 premium unleaded (required)       335                6
## 2  BMW   1 Series 2011 premium unleaded (required)       300                6
## 3  BMW   1 Series 2011 premium unleaded (required)       300                6
## 4  BMW   1 Series 2011 premium unleaded (required)       230                6
## 5  BMW   1 Series 2011 premium unleaded (required)       230                6
## 6  BMW   1 Series 2012 premium unleaded (required)       230                6
##   Transmission.Type    Driven_Wheels Number.of.Doors
## 1            MANUAL rear wheel drive               2
## 2            MANUAL rear wheel drive               2
## 3            MANUAL rear wheel drive               2
## 4            MANUAL rear wheel drive               2
## 5            MANUAL rear wheel drive               2
## 6            MANUAL rear wheel drive               2
##                         Market.Category Vehicle.Size Vehicle.Style highway.MPG
## 1 Factory Tuner,Luxury,High-Performance      Compact         Coupe          26
## 2                    Luxury,Performance      Compact   Convertible          28
## 3               Luxury,High-Performance      Compact         Coupe          28
## 4                    Luxury,Performance      Compact         Coupe          28
## 5                                Luxury      Compact   Convertible          28
## 6                    Luxury,Performance      Compact         Coupe          28
##   city.mpg Popularity  MSRP
## 1       19       3916 46135
## 2       19       3916 40650
## 3       20       3916 36350
## 4       18       3916 29450
## 5       18       3916 34500
## 6       18       3916 31200
tail(data)
##          Make  Model Year               Engine.Fuel.Type Engine.HP
## 11909   Acura    ZDX 2011    premium unleaded (required)       300
## 11910   Acura    ZDX 2012    premium unleaded (required)       300
## 11911   Acura    ZDX 2012    premium unleaded (required)       300
## 11912   Acura    ZDX 2012    premium unleaded (required)       300
## 11913   Acura    ZDX 2013 premium unleaded (recommended)       300
## 11914 Lincoln Zephyr 2006               regular unleaded       221
##       Engine.Cylinders Transmission.Type     Driven_Wheels Number.of.Doors
## 11909                6         AUTOMATIC   all wheel drive               4
## 11910                6         AUTOMATIC   all wheel drive               4
## 11911                6         AUTOMATIC   all wheel drive               4
## 11912                6         AUTOMATIC   all wheel drive               4
## 11913                6         AUTOMATIC   all wheel drive               4
## 11914                6         AUTOMATIC front wheel drive               4
##                  Market.Category Vehicle.Size Vehicle.Style highway.MPG
## 11909 Crossover,Hatchback,Luxury      Midsize 4dr Hatchback          23
## 11910 Crossover,Hatchback,Luxury      Midsize 4dr Hatchback          23
## 11911 Crossover,Hatchback,Luxury      Midsize 4dr Hatchback          23
## 11912 Crossover,Hatchback,Luxury      Midsize 4dr Hatchback          23
## 11913 Crossover,Hatchback,Luxury      Midsize 4dr Hatchback          23
## 11914                     Luxury      Midsize         Sedan          26
##       city.mpg Popularity  MSRP
## 11909       16        204 50520
## 11910       16        204 46120
## 11911       16        204 56670
## 11912       16        204 50620
## 11913       16        204 50920
## 11914       17         61 28995
summary(data)
##      Make              Model                Year      Engine.Fuel.Type  
##  Length:11914       Length:11914       Min.   :1990   Length:11914      
##  Class :character   Class :character   1st Qu.:2007   Class :character  
##  Mode  :character   Mode  :character   Median :2015   Mode  :character  
##                                        Mean   :2010                     
##                                        3rd Qu.:2016                     
##                                        Max.   :2017                     
##                                                                         
##    Engine.HP      Engine.Cylinders Transmission.Type  Driven_Wheels     
##  Min.   :  55.0   Min.   : 0.000   Length:11914       Length:11914      
##  1st Qu.: 170.0   1st Qu.: 4.000   Class :character   Class :character  
##  Median : 227.0   Median : 6.000   Mode  :character   Mode  :character  
##  Mean   : 249.4   Mean   : 5.629                                        
##  3rd Qu.: 300.0   3rd Qu.: 6.000                                        
##  Max.   :1001.0   Max.   :16.000                                        
##  NA's   :69       NA's   :30                                            
##  Number.of.Doors Market.Category    Vehicle.Size       Vehicle.Style     
##  Min.   :2.000   Length:11914       Length:11914       Length:11914      
##  1st Qu.:2.000   Class :character   Class :character   Class :character  
##  Median :4.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :3.436                                                           
##  3rd Qu.:4.000                                                           
##  Max.   :4.000                                                           
##  NA's   :6                                                               
##   highway.MPG        city.mpg        Popularity        MSRP        
##  Min.   : 12.00   Min.   :  7.00   Min.   :   2   Min.   :   2000  
##  1st Qu.: 22.00   1st Qu.: 16.00   1st Qu.: 549   1st Qu.:  21000  
##  Median : 26.00   Median : 18.00   Median :1385   Median :  29995  
##  Mean   : 26.64   Mean   : 19.73   Mean   :1555   Mean   :  40595  
##  3rd Qu.: 30.00   3rd Qu.: 22.00   3rd Qu.:2009   3rd Qu.:  42231  
##  Max.   :354.00   Max.   :137.00   Max.   :5657   Max.   :2065902  
## 
plot(data)

# See the column names
names(data)
##  [1] "Make"              "Model"             "Year"             
##  [4] "Engine.Fuel.Type"  "Engine.HP"         "Engine.Cylinders" 
##  [7] "Transmission.Type" "Driven_Wheels"     "Number.of.Doors"  
## [10] "Market.Category"   "Vehicle.Size"      "Vehicle.Style"    
## [13] "highway.MPG"       "city.mpg"          "Popularity"       
## [16] "MSRP"
# Remove duplicate rows
clean_data <- data[!duplicated(data), ]
# Note the changes in the dimensions
dim(data)
## [1] 11914    16
dim(clean_data)
## [1] 11199    16
# Remove any row with one or more NAs
clean_data_NA <- na.omit(clean_data)
dim(clean_data_NA)
## [1] 11100    16

7. Data visualization with base R

Useful references: https://r-graph-gallery.com/base-R.html https://www.sthda.com/english/wiki/r-base-graphs

a. Basic plot

# Horsepower versus price

plot(x = clean_data_NA$Engine.HP,
     y = clean_data_NA$MSRP)

# Adding more elements to make the plot look better

plot(x = clean_data_NA$Engine.HP,
     y = clean_data_NA$MSRP,
     main = "Horsepower vs Minimum Selling Retail Price", # title
     xlab = "engine horsepower", # x-axis title
     ylab = "minimum selling retail price") # y-axis title

b. Boxplot

Used to visualize the distribution of a numeric variable showing its median, quartiles, range, and potential outliers.

boxplot(clean_data_NA$MSRP, 
        ylab = "price")

c. Histogram

Used to visualize the distribution of a numeric variable by dividing it into bins (intervals) and counting how many values fall into each bin

hist(clean_data_NA$MSRP,
     main = "Histogram of price",  # title
     xlab = "Value",               # x-axis label
     ylab = "Frequency",           # y-axis label
     col = "lightblue",            # color
     border = "black",             # border color
     breaks = 5)                   # number of breaks

## d. Heat map

# Convert the mtcars data set to a matrix
mtcars_matrix <- as.matrix(mtcars)
mtcars_matrix
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
# Create a heatmap

heatmap(mtcars_matrix,
        main = "Heatmap of mtcars",
        col = heat.colors(256),
        scale = "column")

8. Installing packages in R

Packages provide additional functions, datasets, and tools that are not included in base R

Installing Tidyverse

tidyverse is a collection of R packages designed for data science. It includes tools for data manipulation, visualization, importing, and cleaning.

install.packages("tidyverse")
# Load the library after installing the package so that we can access its functions

library(tidyverse)
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

a. dplyr

Let us use the mtcars data set in R.The mtcars dataset is a built-in dataset in R that contains information about 32 different car models from the 1970s.

# Check the dimensions of the data set
dim(mtcars)
## [1] 32 11
# Check the column names
colnames(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
# View the first few rows
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# View the structure
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
# Get the summary statistics
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
# Check for missing values

sum(is.na(mtcars))          # Total NA values
## [1] 0
colSums(is.na(mtcars))      # NA per column
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    0    0    0    0    0    0    0    0    0    0    0

a.1. Using dplyr package functions

# Use glimpse() from dplyr for a quick overview
glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
# EXAMPLE 1: Filter cars with mpg greater than 20

mtcars %>% 
  filter(mpg > 20) %>% 
  head()
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
# %>% = pipe operator = mechanism for chaining operations, allowing the output of one function to be seamlessly passed as the input to the next
# EXAMPLE 2: Summarize average mpg by number of cylinders

mtcars %>%
  group_by(cyl) %>%
  summarise(avg_mpg = mean(mpg),
            avg_hp = mean(hp),
            count = n())
## # A tibble: 3 × 4
##     cyl avg_mpg avg_hp count
##   <dbl>   <dbl>  <dbl> <int>
## 1     4    26.7   82.6    11
## 2     6    19.7  122.      7
## 3     8    15.1  209.     14
# EXAMPLE 3: Add a simple calculated column

mtcars_new <- mtcars %>%
  mutate(power_to_weight = hp / wt)

head(mtcars_new)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
##                   power_to_weight
## Mazda RX4                41.98473
## Mazda RX4 Wag            38.26087
## Datsun 710               40.08621
## Hornet 4 Drive           34.21462
## Hornet Sportabout        50.87209
## Valiant                  30.34682
# EXAMPLE 4: Add a categorical column based on conditions

mtcars_new <- mtcars %>%
  mutate(mpg_category = case_when(
    mpg >= 25 ~ "High",
    mpg >= 15 ~ "Medium",
    TRUE ~ "Low"
  ))

head(mtcars_new)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
##                   mpg_category
## Mazda RX4               Medium
## Mazda RX4 Wag           Medium
## Datsun 710              Medium
## Hornet 4 Drive          Medium
## Hornet Sportabout       Medium
## Valiant                 Medium

b. ggplot2

b.1. Basic syntax

# Basic syntax of ggplot()

ggplot(data = mtcars,          # data set
       aes(x = wt, y = mpg)) + # x and y
  geom_point()                 # geometry

# Customization

ggplot(data = mtcars,               # data set
       aes(x = wt, y = mpg,         # x and y
           color = factor(cyl))) +  # color based on variable
  geom_point(size = 3)  +           # geometry
  geom_smooth(method = "lm") +      # linear regression line
  labs(                             # labels
    title = "MPG vs Weight",
    x = "Weight (1000 lbs)",
    y = "Miles per gallon",
    color = "Cylinders") +
  
  theme_minimal()                   # theme
## `geom_smooth()` using formula = 'y ~ x'

b.2. Histogram

# Histogram of miles per gallon (mpg)

ggplot(mtcars, aes(x = mpg)) +
  geom_histogram(bins = 10, 
                 fill = "steelblue", 
                 color = "white") +
  theme_minimal()

b.3. Boxplot

# Boxplot of mpg by cylinder

ggplot(mtcars, 
       aes(x = factor(cyl), y = mpg)) +
  geom_boxplot(fill = "orange") +
  labs(x = "Cylinders", 
       y = "Miles Per Gallon", 
       title = "MPG by Cylinder Count") +
  theme_minimal()

b.4. Barplot

# Barplot of car counts by number of gears

ggplot(mtcars, 
       aes(x = factor(gear))) +
  geom_bar(fill = "steelblue") +
  labs(title = "Count of Cars by Gear", 
       x = "Number of Gears", 
       y = "Count") +
  theme_minimal()

b.5. Faceted scatterplot

# Facet plot: scatterplot faceted by transmission type

ggplot(mtcars, 
       aes(x = wt, y = mpg)) +
  geom_point() +
  facet_wrap(~ am, labeller = labeller(am = c("0" = "Automatic", "1" = "Manual"))) +
  labs(title = "MPG vs Weight by Transmission Type") +
  theme_minimal()

### b.6. Heat map

# Install additional package
install.packages("reshape2")
# Load the library
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
# Reshape the mtcars data for ggplot

# Add car names as a column
mtcars$car <- rownames(mtcars)
mtcars_melt <- melt(mtcars, id.vars = "car")

# Check the data frame
head(mtcars_melt)
##                 car variable value
## 1         Mazda RX4      mpg  21.0
## 2     Mazda RX4 Wag      mpg  21.0
## 3        Datsun 710      mpg  22.8
## 4    Hornet 4 Drive      mpg  21.4
## 5 Hornet Sportabout      mpg  18.7
## 6           Valiant      mpg  18.1
# Plot

ggplot(mtcars_melt, aes(x = variable, y = car, fill = value)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "red") +
  theme_minimal() +
  labs(title = "Heatmap of mtcars dataset", x = "Variable", y = "Car Model") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

b.7. More complex examples

# ggplot with multiple aesthetics

# Prepare data: convert cyl and am to factors
mtcars_plot <- mtcars %>%
  mutate(
    cyl = as.factor(cyl),
    am = factor(am, labels = c("Automatic", "Manual"))
  )

# Create complex plot
ggplot(mtcars_plot, aes(x = hp, y = mpg, color = cyl, size = wt)) +
  geom_point(alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "black") +
  facet_wrap(~ am) +
  labs(
    title = "Fuel Efficiency vs Horsepower by Transmission Type",
    subtitle = "Point size represents vehicle weight; color represents cylinder count",
    x = "Horsepower (hp)",
    y = "Miles per Gallon (mpg)",
    color = "Cylinders",
    size = "Weight (1000 lbs)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12),
    strip.text = element_text(face = "bold")
  )
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: size.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: size.
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

b.8. Interactive ggplot

# Install and load plotly
install.packages("plotly")
# Load the library
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# Build the plot and store in a variable

mtcars_plot <- mtcars %>%
  mutate(
    cyl = as.factor(cyl),
    am = factor(am, labels = c("Automatic", "Manual"))
  )

p <- ggplot(mtcars_plot, aes(x = hp, y = mpg, color = cyl, size = wt,
                            text = paste("Model:", rownames(mtcars_plot),
                                         "<br>HP:", hp,
                                         "<br>MPG:", mpg,
                                         "<br>Weight:", wt))) +
  geom_point(alpha = 0.8) +
  geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "black") +
  facet_wrap(~ am) +
  labs(
    title = "Fuel Efficiency vs Horsepower by Transmission Type",
    subtitle = "Point size represents vehicle weight; color represents cylinder count",
    x = "Horsepower (hp)",
    y = "Miles per Gallon (mpg)",
    color = "Cylinders",
    size = "Weight (1000 lbs)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12),
    strip.text = element_text(face = "bold")
  )

mtcars_plot
##                      mpg cyl  disp  hp drat    wt  qsec vs        am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0    Manual    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0    Manual    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1    Manual    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1 Automatic    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0 Automatic    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1 Automatic    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0 Automatic    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1 Automatic    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1 Automatic    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1 Automatic    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1 Automatic    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0 Automatic    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0 Automatic    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0 Automatic    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0 Automatic    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0 Automatic    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0 Automatic    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1    Manual    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1    Manual    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1    Manual    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1 Automatic    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0 Automatic    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0 Automatic    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0 Automatic    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0 Automatic    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1    Manual    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0    Manual    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1    Manual    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0    Manual    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0    Manual    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0    Manual    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1    Manual    4    2
##                                     car
## Mazda RX4                     Mazda RX4
## Mazda RX4 Wag             Mazda RX4 Wag
## Datsun 710                   Datsun 710
## Hornet 4 Drive           Hornet 4 Drive
## Hornet Sportabout     Hornet Sportabout
## Valiant                         Valiant
## Duster 360                   Duster 360
## Merc 240D                     Merc 240D
## Merc 230                       Merc 230
## Merc 280                       Merc 280
## Merc 280C                     Merc 280C
## Merc 450SE                   Merc 450SE
## Merc 450SL                   Merc 450SL
## Merc 450SLC                 Merc 450SLC
## Cadillac Fleetwood   Cadillac Fleetwood
## Lincoln Continental Lincoln Continental
## Chrysler Imperial     Chrysler Imperial
## Fiat 128                       Fiat 128
## Honda Civic                 Honda Civic
## Toyota Corolla           Toyota Corolla
## Toyota Corona             Toyota Corona
## Dodge Challenger       Dodge Challenger
## AMC Javelin                 AMC Javelin
## Camaro Z28                   Camaro Z28
## Pontiac Firebird       Pontiac Firebird
## Fiat X1-9                     Fiat X1-9
## Porsche 914-2             Porsche 914-2
## Lotus Europa               Lotus Europa
## Ford Pantera L           Ford Pantera L
## Ferrari Dino               Ferrari Dino
## Maserati Bora             Maserati Bora
## Volvo 142E                   Volvo 142E
# Convert to interactive plot


ggplotly(p, tooltip = "text")
## `geom_smooth()` using formula = 'y ~ x'